Jessica Wells and Jason Gerstenberger (Advisor: Dr. Cohen)
Invalid Date
Introduction
Background
Restricted Boltzmann Machines (RBM) are a type of neural network that has been around since the 1980s. RBMs are primarily used for unsupervised learning tasks like dimensionality reduction and feature extraction, which help prepare datasets for machine learning models that may later be trained using supervised learning.
Like Hopfield networks, Boltzmann machines are undirected graphical models, but they are different in that they are stochastic and can have hidden units. Both models are energy-based, meaning they learn by minimizing an energy function for each model (Smolensky et al. 1986). Boltzmann machines use a sigmoid activation function, which allows for the model to be probabilistic.
In the “Restricted” Boltzmann Machine, there are no interactions between neurons in the visible layer or between neurons in the hidden layer, creating a bipartite graph of neurons. Below is a diagram taken from Goodfellow, et al. (Goodfellow, Bengio, and Courville 2016) (p. 577) for visualization of the connections.
More on Restricted Boltzmann Machines
Goodfellow, et al. discuss the expense in drawing samples for most undirected graphical models; however, the RBM allows for block Gibbs sampling (p. 578) where the network alternates between sampling all hidden units simultaneously (etc. for visible). Derivatives are also simplified by the fact that the energy function of the RBM is a linear function of it’s parameters, which will be seen further in Methods.
RBMs are trained using a process called Contrastive Divergence (CD) (Hinton 2002) where the weights are updated to minimize the difference between samples from the data and samples from the model. Learning rate, batch size, and number of hidden units are all hyperparameters that can affect the ability of the training to converge successfully and learn the underlying structure of the data.
Applications
RBMs are probably best known for their success in collaborative filtering. The RBM model was used in the Netflix Prize competition to predict user ratings for movies, with the result that it outperformed the Singular Value Decomposition (SVD) method that was state-of-the-art at the time (Salakhutdinov, Mnih, and Hinton 2007). They have also been trained to recognize handwritten digits, such as the MNIST dataset (Hinton 2002).
RBMs have been successfully used to distinguish normal and anomalous network traffic. Their potential use in improving network security for companies in the future is promising. There is slow progress in network anomaly detection due to the difficulty of obtaining datasets for training and testing networks. Clients are often reluctant to divulge information that could potentially harm their networks. In a real-life dataset where one host had normal traffic and one was infected by a bot, discriminative RBM (DRBM) was able to successfully distinguish the normal from anomalous traffic. DRBM doesn’t rely on knowing the data distribution ahead of time, which is useful, except that it also causes the DRBM to overfit. As a result, when trying to use the same trained RBM on the KDD ’99 training dataset performance declined. (Fiore et al. 2013)
RBMs can provide greatly improved classification of brain disorders in MRI images. Generative Adversarial Networks (GANs) use two neural networks: a generator which generates fake data, and a discriminator which tries to distinguish between real and fake data. Loss from the discriminator is backpropagated through the generator so that both part are trained simultaneously. The RBM-GAN uses RBM features from real MRI images as inputs to the generator. Features from the discriminator are then used as inputs to a classifier. (Aslan, Dogan, and Koca 2023)
The many-body quantum wavefunction, which describes the quantum state of a system of particles is difficult to compute with classical computers. RBMs have been used to approximate it using variational Monte Carlo methods. (Melko et al. 2019)
RBMs are notoriously slow to train. The process of computing the activation probability requires the calculation of vector dot products. Lean Constrastive Divergence (LCD) is a method which adds two techniques to speed up the process of training RBMs. The first is bounds-based filtering where upper and lower bounds of the probability select only a range of dot products to perform. Second, the delta product involves only recalculating the changed portions of the vector dot product. (Ning, Pittman, and Shen 2018)
Methods
Below is the energy function of the RBM.
\[
E(v,h) = - \sum_{i} a_i v_i - \sum_{j} b_j h_j - \sum_{i} \sum_{j} v_i w_{i,j} h_j
\qquad(1)\] where vi and hj represent visible and hidden units; ai and bj are the bias terms of the visible and hidden units; and each w{i,j} (weight) element represents the interaction between the visible and hidden units. (Fischer and Igel 2012)
Background on Models for Classification Task
We train Logistic Regression (with and without RBM features as input), Feed Forward Network (with and without RBM features as input), and Convolutional Neural Network. Below is a brief reminder of the basics of each model.
For the models incoroporating the RBM, we take the Fashion MNIST features/pixels and train the RBM (unsupervised learning) to extract hidden features from the visible layer and then feed these features into either logistic regression or feed forward network. We then use the trained model to predict labels for the test data, evaluating how well the RBM-derived features perform in a supervised classification task.
1. Logistic Regression
Mathematically, the concept behind binary logistic regression is the logit (the natural logarithm of an odds ratio)(Peng, Lee, and Ingersoll 2002). However, since we have 10 labels, our classification task falls into “Multinomial Logistic Regression.”
The feed forward network (FNN) is one where information flows in one direction from input to output with no loops or feedback. There can be zero hidden layers in between (called single FNN) or one or more hidden layers (multilayer FNN). (Sazlı 2006)
3. Convolutional Neural Network
The convolutional neural network (CNN) is a type of feed forward network except that unlike the traditional ANN, CNNs are primarily used for pattern recognition with images (O’Shea and Nash 2015). The CNN has 3 layers which are stacked to form the full CNN: convolutional, pooling, and fully-connected layers.
Creating the RBM
Below is our Process for creating the RBM:
Step 1: We first initialize the RBM with random weights and biases and set visible units to 784 and hidden units to 256. We also set the number of contrastive divergence steps (k) to 1. Step 2: Sample hidden units from visible. The math behind computing the hidden unit activations from the given input can be seen in Equation 3(Fischer and Igel 2012) where the probability is used to sample from the Bernoulli distribution. \[
p(H_i = 1 | \mathbf{v}) = \sigma \left( \sum_{j=1}^{m} w_{ij} v_j + c_i \right)
\qquad(3)\]
where p(.) is the probability of the ith hidden state being activated (=1) given the visible input vector. σ is the sigmoid activation function (below) which maps the weighted sum to a probability between 0 and 1. m is the number of visible units. wij is the weight connecting visible unit j to hidden unit i. vj is the value of the jth visible unit. and ci is the bias term for the hidden unit. \[
\sigma(x) = \frac{1}{1 + e^{-x}}
\]
Step 3: Sample visible units from hidden. The math behind computing visible unit activations from the hidden layer can be seen in Equation 4(Fischer and Igel 2012) Visible states are sampled using the Bernoulli distribution. This way we can see how well the RBM learned from the inputs. \[
p(V_j = 1 | \mathbf{h}) = \sigma \left( \sum_{i=1}^{n} w_{ij} h_i + b_j \right)
\qquad(4)\]
where p(.) is the probability of the ith visible unit being activated (=1) given the hidden vector h. σ is same as above. n is the number of hidden units. wij is the weight connecting hidden unit i to visible unit j. bj is the bias term for the jth visible unit.
Step 4: K=1 steps of Contrastive Divergence (Feed Forward, Feed Backward) which executes steps 2 and 3. Contrastive Divergence updates the RBM’s weights by minimizing the difference between the original input and the reconstructed input created by the RBM. Step 5: Free energy is computed. The free energy F is given by the logarithm of the partition function Z (Oh, Baggag, and Nha 2020) where the partition function is \[
Z(\theta) \equiv \sum_{v,h} e^{-E(v,h; \theta)}
\qquad(5)\] and the free energy function is \[
F(\theta) = -\ln Z(\theta)
\qquad(6)\] where lower free energy means the RBM learned the visible state well.
Step 6: Train the RBM. Model weights updated via gradient descent. Step 7: Feature extraction for classification with LR. The hidden layer activations of the RBM are used as features for Logistic Regression and Feed Forward Network.
Hyperparameter Tuning
We use the Tree-structured Parzen Estimator algorithm from Optuna (Akiba et al. 2019) to tune the hyperparameters of the RBM and the classifier models, and we use MLFlow (Zaharia et al. 2018) to record and visualize the results of the hyperparameter tuning process. The hyperparameters we tune include the learning rate, batch size, number of hidden units, and number of epochs.
Metrics Used
1. Accuracy Accuracy is defined as the number of correct classifications divided by the total number of classifications \[
\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}
\]
2. Macro F1 Score Macro F1 score is the unweighted average of the individual F1 scores of each class. It takes no regard for class imbalance; however, we saw earlier the classes are all balanced in Fashion MNIST. The F1 score for each individual class is as follows \[
\text{F1} = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
\] where precision for each class is \[
\text{Precision} = \frac{TP}{TP + FP}
\] and recall for each class is \[
\text{Recall} = \frac{TP}{TP + FN}
\] The definitions of these terms for multiclass problems are more complicated than binary and are best displayed as examples.
Acronymn
Example for a trouser image
TP = True Positives
the image is a trouser and the model predicts a trouser
TN = True Negatives
the image is not a trouser and the model predicts anything but trouser
FP = False Positives
the image is anything but trouser but the model predicts trouser
FN = False Negatives
the image is a trouser and the model predicts another class (like shirt)
As stated earlier, the individual F1 scores for each class are taken and averaged to compute the Macro F1 score in a multiclass problem like Fashion MNIST.
Analysis and Results
Data Exploration and Visualization
We use the Fashion MNIST dataset from Zalando Research (Xiao, Rasul, and Vollgraf 2017). The set includes 70,000 grayscale images of clothing items, 60,000 for training and 10,000 for testing. Each image is 28x28 pixels (784 pixels total). Each pixel has a value associated with it ranging from 0 (white) to 255 (very dark) – whole numbers only. There are 785 columns in total as one column is dedicated to the label.
import pandas as pdimport numpy as npfrom sklearn.linear_model import LogisticRegressionimport torchimport torchvision.datasetsimport torchvision.modelsimport torchvision.transforms as transformsimport matplotlib.pyplot as plttrain_data = torchvision.datasets.FashionMNIST( root="./data", train=True, download=True, transform=transforms.ToTensor() # Converts to tensor but does NOT normalize)test_data = torchvision.datasets.FashionMNIST( root="./data", train=False, download=True, transform=transforms.ToTensor() )
Get the seventh image to show a sample
# Extract the first image (or choose any index)image_tensor, label = train_data[6] # shape: [1, 28, 28]# Convert tensor to NumPy arrayimage_array = image_tensor.numpy().squeeze() # Plot the imageplt.figure(figsize=(5,5))plt.imshow(image_array, cmap="gray")plt.title(f"FashionMNIST Image (Label: {label})")plt.axis("off") # Hide axesplt.show()
t-SNE Visualization t-distributed Stochastic Neighbor Embedding (t-SNE) is used here to visualize the separation between classes in a high-dimensional dataset. Each point represents a single fashion item (e.g., T-shirt, Trouser, etc.), and the color corresponds to its true label across the 10 categories listed above.
from sklearn.manifold import TSNEimport matplotlib.pyplot as plt# Run t-SNE to reduce dimensionality#embeddings = TSNE(n_jobs=2).fit_transform(X)tsne = TSNE(n_jobs=-1, random_state=42) # Use -1 to use all available coresembeddings = tsne.fit_transform(X) #use scikitlearn instead# Create scatter plotfigure = plt.figure(figsize=(15,7))plt.scatter(embeddings[:, 0], embeddings[:, 1], c=train_labels, cmap=plt.cm.get_cmap("jet", 10), marker='.')plt.colorbar(ticks=range(10))plt.clim(-0.5, 9.5)plt.title("t-SNE Visualization of Fashion MNIST")plt.show()
What the visualization shows: Class 1 (blue / Trousers) forms a clearly distinct and tightly packed cluster, indicating that the pixel patterns for trousers are less similar to those of other classes. In contrast, Classes 4 (Coat), 6 (Shirt), and 2 (Pullover) show significant overlap, suggesting that these clothing items are harder to distinguish visually and may lead to more confusion during classification.
Modeling and Results
Our Goal We are classifying Fashion MNIST images into one of 10 categories. To evaluate performance, we’re comparing five different models — some trained on raw pixel values and others using features extracted by a Restricted Boltzmann Machine (RBM). Our objective is to assess whether incorporating RBM into the workflow improves classification accuracy compared to using raw image data alone.
Our Models
Logistic Regression on Fashion MNIST Data
Feed Forward Network on Fashion MNIST Data 3. Convolutional Neural Network on Fashion MNIST Data 4. Logistic Regression on RBM Hidden Features (of Fashion MNIST Data) 5. Feed Forward Network on RBM Hidden Features (of Fashion MNIST Data)
Note: Outputs (50 trials) and Code are below for each model. Both the code and output can be toggled by the reader. • The first click reveals a toggle labeled “Code”. • Clicking “Code” will show the output. • Clicking again will switch from output to the actual code. • Clicking “Show Code and Output” again will collapse both views.
Import Libraries and Re-load data for first 3 models
import torchimport torch.nn as nnimport torch.optim as optimimport torch.nn.functional as Ffrom torchvision import datasets, transformsimport numpy as npimport mlflowimport optunafrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import accuracy_scorefrom torch.utils.data import DataLoader# Set devicedevice = torch.device("mps")# Load Fashion-MNIST dataset again for the first 3 modelstransform = transforms.Compose([transforms.ToTensor()])train_dataset = datasets.FashionMNIST(root='./data', train=True, transform=transform, download=True)test_dataset = datasets.FashionMNIST(root='./data', train=False, transform=transform, download=True)
#mlflow.end_run()#run this in terminal when need to fully clean out expierment after you delete it in the ui#rm -rf mlruns/.trash/*
Model 1
Model 1: Logistic Regression on Fashion MNIST Data
Click to Show Code and Output
from sklearn.metrics import f1_scoreCLASSIFIER ="LogisticRegression"# Change for FNN, LogisticRegression, or CNN# Define CNN modelclass FashionCNN(nn.Module):def__init__(self, filters1, filters2, kernel1, kernel2):super(FashionCNN, self).__init__()self.layer1 = nn.Sequential( nn.Conv2d(in_channels=1, out_channels=filters1, kernel_size=kernel1, padding=1), nn.BatchNorm2d(filters1), nn.ReLU(), nn.MaxPool2d(kernel_size=2, stride=2) )self.layer2 = nn.Sequential( nn.Conv2d(in_channels=filters1, out_channels=filters2, kernel_size=kernel2), nn.BatchNorm2d(filters2), nn.ReLU(), nn.MaxPool2d(2) )self.fc1 =None#initialize first fully connected layer as none, defined later in fwdself.drop = nn.Dropout2d(0.25)self.fc2 = nn.Linear(in_features=600, out_features=120)self.fc3 = nn.Linear(in_features=120, out_features=10)def forward(self, x): out =self.layer1(x) out =self.layer2(out)#Flatten tensor dynamically, preserve batch size out = out.view(out.size(0), -1) ifself.fc1 isNone:self.fc1 = nn.Linear(out.shape[1], 600).to(x.device) out =self.fc1(out) out =self.drop(out) out =self.fc2(out) out =self.fc3(out)return out# Define Optuna objective functiondef objective(trial):# Set MLflow experiment nameif CLASSIFIER =="LogisticRegression": experiment = mlflow.set_experiment("new-pytorch-fmnist-lr-noRBM")elif CLASSIFIER =="FNN": experiment = mlflow.set_experiment("new-pytorch-fmnist-fnn-noRBM")elif CLASSIFIER =="CNN": experiment = mlflow.set_experiment("new-pytorch-fmnist-cnn-noRBM") batch_size = trial.suggest_int("batch_size", 64, 256, step=32) train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True) test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False) mlflow.start_run(experiment_id=experiment.experiment_id) num_classifier_epochs = trial.suggest_int("num_classifier_epochs", 5, 5) mlflow.log_param("num_classifier_epochs", num_classifier_epochs)if CLASSIFIER =="FNN": hidden_size = trial.suggest_int("fnn_hidden", 192, 384) learning_rate = trial.suggest_float("learning_rate", 0.0001, 0.0025) mlflow.log_param("classifier", "FNN") mlflow.log_param("fnn_hidden", hidden_size) mlflow.log_param("learning_rate", learning_rate) model = nn.Sequential( nn.Linear(784, hidden_size), nn.ReLU(), nn.Linear(hidden_size, 10) ).to(device) optimizer = optim.Adam(model.parameters(), lr=learning_rate)elif CLASSIFIER =="CNN": filters1 = trial.suggest_int("filters1", 16, 64, step=16) filters2 = trial.suggest_int("filters2", 32, 128, step=32) kernel1 = trial.suggest_int("kernel1", 3, 5) kernel2 = trial.suggest_int("kernel2", 3, 5) learning_rate = trial.suggest_float("learning_rate", 0.0001, 0.0025) mlflow.log_param("classifier", "CNN") mlflow.log_param("filters1", filters1) mlflow.log_param("filters2", filters2) mlflow.log_param("kernel1", kernel1) mlflow.log_param("kernel2", kernel2) mlflow.log_param("learning_rate", learning_rate) model = FashionCNN(filters1, filters2, kernel1, kernel2).to(device) optimizer = optim.Adam(model.parameters(), lr=learning_rate)elif CLASSIFIER =="LogisticRegression": mlflow.log_param("classifier", "LogisticRegression")# Prepare data for Logistic Regression (Flatten 28x28 images to 784 features) train_features = train_dataset.data.view(-1, 784).numpy() train_labels = train_dataset.targets.numpy() test_features = test_dataset.data.view(-1, 784).numpy() test_labels = test_dataset.targets.numpy()# Normalize the pixel values to [0,1] for better convergence train_features = train_features /255.0 test_features = test_features /255.0 C = trial.suggest_float("C", 0.01, 10.0, log=True) solver ="saga" model = LogisticRegression(C=C, max_iter=num_classifier_epochs, solver=solver) model.fit(train_features, train_labels) predictions = model.predict(test_features) accuracy = accuracy_score(test_labels, predictions) *100 macro_f1 = f1_score(test_labels, predictions, average="macro") #for f1print(f"Logistic Regression Test Accuracy: {accuracy:.2f}%")print(f"Macro F1 Score: {macro_f1:.4f}") #for f1 mlflow.log_param("C", C) mlflow.log_metric("test_accuracy", accuracy) mlflow.log_metric("macro_f1", macro_f1) #for f1 mlflow.end_run()return accuracy# Training Loop for FNN and CNN criterion = nn.CrossEntropyLoss() model.train()for epoch inrange(num_classifier_epochs): running_loss =0.0for images, labels in train_loader: images, labels = images.to(device), labels.to(device) outputs = model(images) if CLASSIFIER =="CNN"else model(images.view(images.size(0), -1)) optimizer.zero_grad() loss = criterion(outputs, labels) loss.backward() optimizer.step() running_loss += loss.item()print(f"{CLASSIFIER} Epoch {epoch+1}: loss = {running_loss /len(train_loader):.4f}")# Model Evaluation model.eval() correct, total =0, 0 all_preds = [] # for f1 all_labels = [] with torch.no_grad():for images, labels in test_loader: images, labels = images.to(device), labels.to(device) outputs = model(images) if CLASSIFIER =="CNN"else model(images.view(images.size(0), -1)) _, predicted = torch.max(outputs, 1) total += labels.size(0) correct += (predicted == labels).sum().item() all_preds.extend(predicted.cpu().numpy()) #for f1 all_labels.extend(labels.cpu().numpy()) #for f1 accuracy =100* correct / total macro_f1 = f1_score(all_labels, all_preds, average="macro") #for f1print(f"Test Accuracy: {accuracy:.2f}%")print(f"Macro F1 Score: {macro_f1:.4f}") #for f1 mlflow.log_metric("test_accuracy", accuracy) mlflow.log_metric("macro_f1", macro_f1) #for f1 mlflow.end_run()return accuracyif__name__=="__main__": study = optuna.create_study(direction="maximize") study.optimize(objective, n_trials=1) # n_trials set to 1 for quick renderingprint(f"Best Parameters for {CLASSIFIER}:", study.best_params)print("Best Accuracy:", study.best_value)
Logistic Regression Test Accuracy: 84.48%
Macro F1 Score: 0.8434
Best Parameters for LogisticRegression: {'batch_size': 64, 'num_classifier_epochs': 5, 'C': 0.8871749041881961}
Best Accuracy: 84.48
Test Accuracy of Logistic Regression by C (inverse regularization strength)
\[
C = \frac{1}{\lambda} \quad \text{(inverse regularization strength)}
\]
Lower values of C mean more regularization (higher penalties for larger weight coefficients)
What the plot shows: Most optuna trials were lower values of C, so optimization favors stronger regularization. This is further evidenced by the clustering of higher accuracies for lower values of C. A possibly anomaly is seen at C=10 with fairly high accuracy; however, it’s still not higher than lower values of C.
Model 2
Model 2: Feed Forward Network on Fashion MNIST Data
Click to Show Code and Output
from sklearn.metrics import f1_scoreCLASSIFIER ="FNN"# Change for FNN, LogisticRegression, or CNN# Define CNN modelclass FashionCNN(nn.Module):def__init__(self, filters1, filters2, kernel1, kernel2):super(FashionCNN, self).__init__()self.layer1 = nn.Sequential( nn.Conv2d(in_channels=1, out_channels=filters1, kernel_size=kernel1, padding=1), nn.BatchNorm2d(filters1), nn.ReLU(), nn.MaxPool2d(kernel_size=2, stride=2) )self.layer2 = nn.Sequential( nn.Conv2d(in_channels=filters1, out_channels=filters2, kernel_size=kernel2), nn.BatchNorm2d(filters2), nn.ReLU(), nn.MaxPool2d(2) )self.fc1 =None#initialize first fully connected layer as none, defined later in fwdself.drop = nn.Dropout2d(0.25)self.fc2 = nn.Linear(in_features=600, out_features=120)self.fc3 = nn.Linear(in_features=120, out_features=10)def forward(self, x): out =self.layer1(x) out =self.layer2(out)#Flatten tensor dynamically out = out.view(out.size(0), -1)ifself.fc1 isNone:self.fc1 = nn.Linear(out.shape[1], 600).to(x.device) out =self.fc1(out) out =self.drop(out) out =self.fc2(out) out =self.fc3(out)return out# Define Optuna objective functiondef objective(trial):# Set MLflow experiment nameif CLASSIFIER =="LogisticRegression": experiment = mlflow.set_experiment("new-pytorch-fmnist-lr-noRBM")elif CLASSIFIER =="FNN": experiment = mlflow.set_experiment("new-pytorch-fmnist-fnn-noRBM")elif CLASSIFIER =="CNN": experiment = mlflow.set_experiment("new-pytorch-fmnist-cnn-noRBM") batch_size = trial.suggest_int("batch_size", 64, 256, step=32) train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True) test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False) mlflow.start_run(experiment_id=experiment.experiment_id) num_classifier_epochs = trial.suggest_int("num_classifier_epochs", 5, 5) mlflow.log_param("num_classifier_epochs", num_classifier_epochs)if CLASSIFIER =="FNN": hidden_size = trial.suggest_int("fnn_hidden", 192, 384) learning_rate = trial.suggest_float("learning_rate", 0.0001, 0.0025) mlflow.log_param("classifier", "FNN") mlflow.log_param("fnn_hidden", hidden_size) mlflow.log_param("learning_rate", learning_rate) model = nn.Sequential( nn.Linear(784, hidden_size), nn.ReLU(), nn.Linear(hidden_size, 10) ).to(device) optimizer = optim.Adam(model.parameters(), lr=learning_rate)elif CLASSIFIER =="CNN": filters1 = trial.suggest_int("filters1", 16, 64, step=16) filters2 = trial.suggest_int("filters2", 32, 128, step=32) kernel1 = trial.suggest_int("kernel1", 3, 5) kernel2 = trial.suggest_int("kernel2", 3, 5) learning_rate = trial.suggest_float("learning_rate", 0.0001, 0.0025) mlflow.log_param("classifier", "CNN") mlflow.log_param("filters1", filters1) mlflow.log_param("filters2", filters2) mlflow.log_param("kernel1", kernel1) mlflow.log_param("kernel2", kernel2) mlflow.log_param("learning_rate", learning_rate) model = FashionCNN(filters1, filters2, kernel1, kernel2).to(device) optimizer = optim.Adam(model.parameters(), lr=learning_rate)elif CLASSIFIER =="LogisticRegression": mlflow.log_param("classifier", "LogisticRegression")# Prepare data for Logistic Regression (Flatten 28x28 images to 784 features) train_features = train_dataset.data.view(-1, 784).numpy() train_labels = train_dataset.targets.numpy() test_features = test_dataset.data.view(-1, 784).numpy() test_labels = test_dataset.targets.numpy()# Normalize the pixel values to [0,1] for better convergence train_features = train_features /255.0 test_features = test_features /255.0 C = trial.suggest_float("C", 0.01, 10.0, log=True) solver ="saga" model = LogisticRegression(C=C, max_iter=num_classifier_epochs, solver=solver) model.fit(train_features, train_labels) predictions = model.predict(test_features) accuracy = accuracy_score(test_labels, predictions) *100 macro_f1 = f1_score(test_labels, predictions, average="macro") #for f1print(f"Logistic Regression Test Accuracy: {accuracy:.2f}%")print(f"Macro F1 Score: {macro_f1:.4f}") #for f1 mlflow.log_param("C", C) mlflow.log_metric("test_accuracy", accuracy) mlflow.log_metric("macro_f1", macro_f1) #for f1 mlflow.end_run()return accuracy# Training Loop for FNN and CNN criterion = nn.CrossEntropyLoss() model.train()for epoch inrange(num_classifier_epochs): running_loss =0.0for images, labels in train_loader: images, labels = images.to(device), labels.to(device) outputs = model(images) if CLASSIFIER =="CNN"else model(images.view(images.size(0), -1)) optimizer.zero_grad() loss = criterion(outputs, labels) loss.backward() optimizer.step() running_loss += loss.item()print(f"{CLASSIFIER} Epoch {epoch+1}: loss = {running_loss /len(train_loader):.4f}")# Model Evaluation model.eval() correct, total =0, 0 all_preds = [] # for f1 all_labels = [] with torch.no_grad():for images, labels in test_loader: images, labels = images.to(device), labels.to(device) outputs = model(images) if CLASSIFIER =="CNN"else model(images.view(images.size(0), -1)) _, predicted = torch.max(outputs, 1) total += labels.size(0) correct += (predicted == labels).sum().item() all_preds.extend(predicted.cpu().numpy()) #for f1 all_labels.extend(labels.cpu().numpy()) #for f1 accuracy =100* correct / total macro_f1 = f1_score(all_labels, all_preds, average="macro") #for f1print(f"Test Accuracy: {accuracy:.2f}%")print(f"Macro F1 Score: {macro_f1:.4f}") #for f1 mlflow.log_metric("test_accuracy", accuracy) mlflow.log_metric("macro_f1", macro_f1) #for f1 mlflow.end_run()return accuracyif__name__=="__main__": study = optuna.create_study(direction="maximize") study.optimize(objective, n_trials=1) # n_trials set to 1 for quick renderingprint(f"Best Parameters for {CLASSIFIER}:", study.best_params)print("Best Accuracy:", study.best_value)
FNN Epoch 1: loss = 0.5424
FNN Epoch 2: loss = 0.3883
FNN Epoch 3: loss = 0.3464
FNN Epoch 4: loss = 0.3164
FNN Epoch 5: loss = 0.2988
Test Accuracy: 87.03%
Macro F1 Score: 0.8719
Best Parameters for FNN: {'batch_size': 224, 'num_classifier_epochs': 5, 'fnn_hidden': 317, 'learning_rate': 0.0018773107732941577}
Best Accuracy: 87.03
Test Accuracy by FNN Hidden Units
What the plot shows: Higher values of hidden units in the feedforward network were sampled more frequently by Optuna, suggesting a preference for more complex models. However, test accuracy appears to level off between 300 and 375 hidden units, suggesting complexity reached its optimal range. Further increases in hidden units would likely not yield higher accuracy.
Model 3
Model 3: Convolutional Neural Network on Fashion MNIST Data Base code for CNN structure borrowed from Kaggle
Click to Show Code and Output
from sklearn.metrics import f1_scoreCLASSIFIER ="CNN"# Change for FNN, LogisticRegression, or CNN# Define CNN modelclass FashionCNN(nn.Module):def__init__(self, filters1, filters2, kernel1, kernel2):super(FashionCNN, self).__init__()self.layer1 = nn.Sequential( nn.Conv2d(in_channels=1, out_channels=filters1, kernel_size=kernel1, padding=1), nn.BatchNorm2d(filters1), nn.ReLU(), nn.MaxPool2d(kernel_size=2, stride=2) )self.layer2 = nn.Sequential( nn.Conv2d(in_channels=filters1, out_channels=filters2, kernel_size=kernel2), nn.BatchNorm2d(filters2), nn.ReLU(), nn.MaxPool2d(2) )self.fc1 =None#initialize first fully connected layer as none, defined later in fwdself.drop = nn.Dropout2d(0.25)self.fc2 = nn.Linear(in_features=600, out_features=120)self.fc3 = nn.Linear(in_features=120, out_features=10)def forward(self, x): out =self.layer1(x) out =self.layer2(out)#Flatten tensor dynamically out = out.view(out.size(0), -1)ifself.fc1 isNone:self.fc1 = nn.Linear(out.shape[1], 600).to(x.device) out =self.fc1(out) out =self.drop(out) out =self.fc2(out) out =self.fc3(out)return out# Define Optuna objective functiondef objective(trial):# Set MLflow experiment nameif CLASSIFIER =="LogisticRegression": experiment = mlflow.set_experiment("new-pytorch-fmnist-lr-noRBM")elif CLASSIFIER =="FNN": experiment = mlflow.set_experiment("new-pytorch-fmnist-fnn-noRBM")elif CLASSIFIER =="CNN": experiment = mlflow.set_experiment("new-pytorch-fmnist-cnn-noRBM") batch_size = trial.suggest_int("batch_size", 64, 256, step=32) train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True) test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False) mlflow.start_run(experiment_id=experiment.experiment_id) num_classifier_epochs = trial.suggest_int("num_classifier_epochs", 5, 5) mlflow.log_param("num_classifier_epochs", num_classifier_epochs)if CLASSIFIER =="FNN": hidden_size = trial.suggest_int("fnn_hidden", 192, 384) learning_rate = trial.suggest_float("learning_rate", 0.0001, 0.0025) mlflow.log_param("classifier", "FNN") mlflow.log_param("fnn_hidden", hidden_size) mlflow.log_param("learning_rate", learning_rate) model = nn.Sequential( nn.Linear(784, hidden_size), nn.ReLU(), nn.Linear(hidden_size, 10) ).to(device) optimizer = optim.Adam(model.parameters(), lr=learning_rate)elif CLASSIFIER =="CNN": filters1 = trial.suggest_int("filters1", 16, 64, step=16) filters2 = trial.suggest_int("filters2", 32, 128, step=32) kernel1 = trial.suggest_int("kernel1", 3, 5) kernel2 = trial.suggest_int("kernel2", 3, 5) learning_rate = trial.suggest_float("learning_rate", 0.0001, 0.0025) mlflow.log_param("classifier", "CNN") mlflow.log_param("filters1", filters1) mlflow.log_param("filters2", filters2) mlflow.log_param("kernel1", kernel1) mlflow.log_param("kernel2", kernel2) mlflow.log_param("learning_rate", learning_rate) model = FashionCNN(filters1, filters2, kernel1, kernel2).to(device) optimizer = optim.Adam(model.parameters(), lr=learning_rate)elif CLASSIFIER =="LogisticRegression": mlflow.log_param("classifier", "LogisticRegression")# Prepare data for Logistic Regression (Flatten 28x28 images to 784 features) train_features = train_dataset.data.view(-1, 784).numpy() train_labels = train_dataset.targets.numpy() test_features = test_dataset.data.view(-1, 784).numpy() test_labels = test_dataset.targets.numpy()# Normalize the pixel values to [0,1] for better convergence train_features = train_features /255.0 test_features = test_features /255.0 C = trial.suggest_float("C", 0.01, 10.0, log=True) solver ="saga" model = LogisticRegression(C=C, max_iter=num_classifier_epochs, solver=solver) model.fit(train_features, train_labels) predictions = model.predict(test_features) accuracy = accuracy_score(test_labels, predictions) *100 macro_f1 = f1_score(test_labels, predictions, average="macro") #for f1print(f"Logistic Regression Test Accuracy: {accuracy:.2f}%")print(f"Macro F1 Score: {macro_f1:.4f}") #for f1 mlflow.log_param("C", C) mlflow.log_metric("test_accuracy", accuracy) mlflow.log_metric("macro_f1", macro_f1) #for f1 mlflow.end_run()return accuracy# Training Loop for FNN and CNN criterion = nn.CrossEntropyLoss() model.train()for epoch inrange(num_classifier_epochs): running_loss =0.0for images, labels in train_loader: images, labels = images.to(device), labels.to(device) outputs = model(images) if CLASSIFIER =="CNN"else model(images.view(images.size(0), -1)) optimizer.zero_grad() loss = criterion(outputs, labels) loss.backward() optimizer.step() running_loss += loss.item()print(f"{CLASSIFIER} Epoch {epoch+1}: loss = {running_loss /len(train_loader):.4f}")# Model Evaluation model.eval() correct, total =0, 0 all_preds = [] # for f1 all_labels = [] with torch.no_grad():for images, labels in test_loader: images, labels = images.to(device), labels.to(device) outputs = model(images) if CLASSIFIER =="CNN"else model(images.view(images.size(0), -1)) _, predicted = torch.max(outputs, 1) total += labels.size(0) correct += (predicted == labels).sum().item() all_preds.extend(predicted.cpu().numpy()) #for f1 all_labels.extend(labels.cpu().numpy()) #for f1 accuracy =100* correct / total macro_f1 = f1_score(all_labels, all_preds, average="macro") #for f1print(f"Test Accuracy: {accuracy:.2f}%")print(f"Macro F1 Score: {macro_f1:.4f}") #for f1 mlflow.log_metric("test_accuracy", accuracy) mlflow.log_metric("macro_f1", macro_f1) #for f1 mlflow.end_run()return accuracyif__name__=="__main__": study = optuna.create_study(direction="maximize") study.optimize(objective, n_trials=1) # n_trials set to 1 for quick renderingprint(f"Best Parameters for {CLASSIFIER}:", study.best_params)print("Best Accuracy:", study.best_value)
CNN Epoch 1: loss = 0.4298
CNN Epoch 2: loss = 0.3068
CNN Epoch 3: loss = 0.2720
CNN Epoch 4: loss = 0.2511
CNN Epoch 5: loss = 0.2350
Test Accuracy: 89.36%
Macro F1 Score: 0.8940
Best Parameters for CNN: {'batch_size': 192, 'num_classifier_epochs': 5, 'filters1': 32, 'filters2': 96, 'kernel1': 5, 'kernel2': 4, 'learning_rate': 0.0020717573771718055}
Best Accuracy: 89.36
Test Accuracy Based on the Number of Filters in the First Conv2D Layer
What the plot shows: Although the highest test accuracy was achieved with 64 filters in the first convolutional 2D layer, the number of filters alone isn’t a strong predictor of model performance. Each filter size shows high variance (accuracies are spread out for each value vertically). This, combined with the fact that accuracies are well distributed across the different filter counts, suggests other factors or hyperparameters may play a bigger role in predicting accuracy.
Test Accuracy Based on the Number of Filters in the Second Conv2D Layer
What the plot shows: Like the first Conv2D layer, the number of filters doesn’t seem to be a extremely strong predictor in accuracy. However, Optuna has sampled more frequently from higher number of filters, even 128 for this second layer, suggesting higher filters performed better. However, like before, there is still high variance in accuracy for each number of filters.
Test Accuracy Based on Kernel Size in the First Conv2D Layer
What the plot shows: Kernel size of 3 was sampled more frequently by Optuna and yielded higher accuracies than kernel sizes of 4 or 5.
Test Accuracy Based on Kernel Size in the Second Conv2D Layer
What the plot shows: Like with the first convolutional 2D layer, kernel size of 3 is highly favored by Optuna and consistently led to higher test accuracies.
Model 4
Model 4: Logistic Regression on RBM Hidden Features (of Fashion MNIST Data)
Click to Show Code and Output
from sklearn.metrics import accuracy_score, f1_scoreCLASSIFIER ='LogisticRegression'if CLASSIFIER =='LogisticRegression': experiment = mlflow.set_experiment("new-pytorch-fmnist-lr-withrbm")else: experiment = mlflow.set_experiment("new-pytorch-fmnist-fnn-withrbm")class RBM(nn.Module):def__init__(self, n_visible=784, n_hidden=256, k=1):super(RBM, self).__init__()self.n_visible = n_visibleself.n_hidden = n_hidden# Initialize weights and biasesself.W = nn.Parameter(torch.randn(n_hidden, n_visible) *0.1)self.v_bias = nn.Parameter(torch.zeros(n_visible))self.h_bias = nn.Parameter(torch.zeros(n_hidden))self.k = k # CD-k stepsdef sample_h(self, v):# Given visible v, sample hidden h p_h = torch.sigmoid(F.linear(v, self.W, self.h_bias)) # p(h=1|v) h_sample = torch.bernoulli(p_h) # sample Bernoullireturn p_h, h_sampledef sample_v(self, h):# Given hidden h, sample visible v p_v = torch.sigmoid(F.linear(h, self.W.t(), self.v_bias)) # p(v=1|h) v_sample = torch.bernoulli(p_v)return p_v, v_sampledef forward(self, v):# Perform k steps of contrastive divergence starting from v v_k = v.clone()for _ inrange(self.k): _, h_k =self.sample_h(v_k) # sample hidden from current visible _, v_k =self.sample_v(h_k) # sample visible from hiddenreturn v_k # k-step reconstructed visibledef free_energy(self, v):# Compute the visible bias term for each sample in the batch vbias_term = (v *self.v_bias).sum(dim=1) # shape: [batch_size]# Compute the activation of the hidden units wx_b = F.linear(v, self.W, self.h_bias) # shape: [batch_size, n_hidden]# Compute the hidden term hidden_term = torch.sum(torch.log1p(torch.exp(wx_b)), dim=1) # shape: [batch_size]# Return the mean free energy over the batchreturn- (vbias_term + hidden_term).mean()transform = transforms.Compose([transforms.ToTensor()])train_dataset = datasets.FashionMNIST(root='./data', train=True, transform=transform, download=True)test_dataset = datasets.FashionMNIST(root='./data', train=False, transform=transform, download=True)def objective(trial): num_rbm_epochs = trial.suggest_int("num_rbm_epochs", 5, 5)# 24, 33) batch_size = trial.suggest_int("batch_size", 192, 1024) rbm_lr = trial.suggest_float("rbm_lr", 0.05, 0.1) rbm_hidden = trial.suggest_int("rbm_hidden", 384, 8192) mlflow.start_run(experiment_id=experiment.experiment_id)if CLASSIFIER !='LogisticRegression': fnn_hidden = trial.suggest_int("fnn_hidden", 192, 384) fnn_lr = trial.suggest_float("fnn_lr", 0.0001, 0.0025) mlflow.log_param("fnn_hidden", fnn_hidden) mlflow.log_param("fnn_lr", fnn_lr) num_classifier_epochs = trial.suggest_int("num_classifier_epochs", 5, 5)# 40, 60) mlflow.log_param("num_rbm_epochs", num_rbm_epochs) mlflow.log_param("batch_size", batch_size) mlflow.log_param("rbm_lr", rbm_lr) mlflow.log_param("rbm_hidden", rbm_hidden) mlflow.log_param("num_classifier_epochs", num_classifier_epochs)# Instantiate RBM and optimizer device = torch.device("mps") rbm = RBM(n_visible=784, n_hidden=rbm_hidden, k=1).to(device) optimizer = torch.optim.SGD(rbm.parameters(), lr=rbm_lr) train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True) test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False) rbm_training_failed =False# Training loop (assuming train_loader yields batches of images and labels)for epoch inrange(num_rbm_epochs): total_loss =0.0for images, _ in train_loader:# Flatten images and binarize v0 = images.view(-1, 784).to(rbm.W.device) # shape [batch_size, 784] v0 = torch.bernoulli(v0) # sample binary input vk = rbm(v0) # k-step CD reconstruction# Compute contrastive divergence loss (free energy difference) loss = rbm.free_energy(v0) - rbm.free_energy(vk) optimizer.zero_grad() loss.backward() optimizer.step() total_loss += loss.item()print(f"Epoch {epoch+1}: avg free-energy loss = {total_loss/len(train_loader):.4f}")if np.isnan(total_loss): rbm_training_failed =Truebreakif rbm_training_failed: accuracy =0.0 macro_f1 =0.0print("RBM training failed — returning 0.0 for accuracy and macro F1") mlflow.log_metric("test_accuracy", accuracy) mlflow.log_metric("macro_f1", macro_f1) mlflow.set_tag("status", "rbm_failed") # Optional tag mlflow.end_run()returnfloat(accuracy)else: rbm.eval() # set in evaluation mode if using any layers that behave differently in training features_list = [] labels_list = []for images, labels in train_loader: v = images.view(-1, 784).to(rbm.W.device) v = v # (optionally binarize or use raw normalized pixels) h_prob, h_sample = rbm.sample_h(v) # get hidden activations features_list.append(h_prob.cpu().detach().numpy()) labels_list.append(labels.numpy()) train_features = np.concatenate(features_list) # shape: [N_train, n_hidden] train_labels = np.concatenate(labels_list)# Convert pre-extracted training features and labels to tensors and create a DataLoader train_features_tensor = torch.tensor(train_features, dtype=torch.float32) train_labels_tensor = torch.tensor(train_labels, dtype=torch.long) train_feature_dataset = torch.utils.data.TensorDataset(train_features_tensor, train_labels_tensor) train_feature_loader = torch.utils.data.DataLoader(train_feature_dataset, batch_size=batch_size, shuffle=True)if CLASSIFIER =='LogisticRegression':# add optuna tuning same as log reg without RBM features... lr_C = trial.suggest_float("lr_C", 0.01, 10.0, log=True) mlflow.log_param("lr_C", lr_C) # Log the chosen C value classifier = LogisticRegression(max_iter=num_classifier_epochs, C=lr_C, solver="saga") classifier.fit(train_features, train_labels) else: classifier = nn.Sequential( nn.Linear(rbm.n_hidden, fnn_hidden), nn.ReLU(), nn.Linear(fnn_hidden, 10) )# Move classifier to the same device as the RBM classifier = classifier.to(device) criterion = nn.CrossEntropyLoss() classifier_optimizer = torch.optim.Adam(classifier.parameters(), lr=fnn_lr) classifier.train()for epoch inrange(num_classifier_epochs): running_loss =0.0for features, labels in train_feature_loader: features = features.to(device) labels = labels.to(device)# Forward pass through classifier outputs = classifier(features) loss = criterion(outputs, labels)# Backpropagation and optimization classifier_optimizer.zero_grad() loss.backward() classifier_optimizer.step() running_loss += loss.item() avg_loss = running_loss /len(train_feature_loader)print(f"Classifier Epoch {epoch+1}: loss = {avg_loss:.4f}")# Evaluate the classifier on test data.# Here we extract features from the RBM for each test image.if CLASSIFIER !='LogisticRegression': classifier.eval() correct =0 total =0 features_list = [] labels_list = []with torch.no_grad():for images, labels in test_loader: v = images.view(-1, 784).to(device)# Extract hidden activations; you can use either h_prob or h_sample. h_prob, _ = rbm.sample_h(v)if CLASSIFIER =='LogisticRegression': features_list.append(h_prob.cpu().detach().numpy()) labels_list.append(labels.numpy())else: outputs = classifier(h_prob) _, predicted = torch.max(outputs.data, 1) total += labels.size(0) correct += (predicted.cpu() == labels).sum().item()if CLASSIFIER =='LogisticRegression': test_features = np.concatenate(features_list) test_labels = np.concatenate(labels_list) predictions = classifier.predict(test_features) accuracy = accuracy_score(test_labels, predictions) *100 macro_f1 = f1_score(test_labels, predictions, average="macro") else: accuracy =100* correct / total all_preds = [] all_labels = [] classifier.eval()with torch.no_grad():for images, labels in test_loader: v = images.view(-1, 784).to(device) h_prob, _ = rbm.sample_h(v) outputs = classifier(h_prob) _, predicted = torch.max(outputs.data, 1) all_preds.extend(predicted.cpu().numpy()) all_labels.extend(labels.numpy()) macro_f1 = f1_score(all_labels, all_preds, average="macro") print(f"Test Accuracy: {accuracy:.2f}%")print(f"Macro F1 Score: {macro_f1:.4f}") mlflow.log_metric("test_accuracy", accuracy) mlflow.log_metric("macro_f1", macro_f1) mlflow.end_run()returnfloat(accuracy if accuracy isnotNoneelse0.0)if__name__=="__main__": study = optuna.create_study(direction="maximize") study.optimize(objective, n_trials=1) # n_trials set to 1 for quick renderingprint(study.best_params)print(study.best_value)print(study.best_trial)
Epoch 1: avg free-energy loss = 33.0973
Epoch 2: avg free-energy loss = 4.4396
Epoch 3: avg free-energy loss = 2.0698
Epoch 4: avg free-energy loss = 0.8885
Epoch 5: avg free-energy loss = 0.2602
Test Accuracy of Logistic Regression on RBM Hidden Features by Inverse Regularization Strength
What the plot shows: When using RBM-extracted hidden features as input to logistic regression, the inverse regularization strength does not appear to be a strong predictor of test accuracy.
Test Accuracy By Number of RBM Hidden Units
What the plot shows: Optuna slightly favors higher number of hidden units in the rbm with a peak at 5340 (and similar peaks 5358, 5341, etc.). However, after 7000 units, accuracy appears to decline suggesting the optimum number of units was reached around that 5300 mark.
Model 5
Model 5: Feed Forward Network on RBM Hidden Features (of Fashion MNIST Data)
Click to Show Code and Output
from sklearn.metrics import accuracy_score, f1_scoreCLASSIFIER ='FNN'if CLASSIFIER =='LogisticRegression': experiment = mlflow.set_experiment("new-pytorch-fmnist-lr-withrbm")else: experiment = mlflow.set_experiment("new-pytorch-fmnist-fnn-withrbm")class RBM(nn.Module):def__init__(self, n_visible=784, n_hidden=256, k=1):super(RBM, self).__init__()self.n_visible = n_visibleself.n_hidden = n_hidden# Initialize weights and biasesself.W = nn.Parameter(torch.randn(n_hidden, n_visible) *0.1)self.v_bias = nn.Parameter(torch.zeros(n_visible))self.h_bias = nn.Parameter(torch.zeros(n_hidden))self.k = k # CD-k stepsdef sample_h(self, v):# Given visible v, sample hidden h p_h = torch.sigmoid(F.linear(v, self.W, self.h_bias)) # p(h=1|v) h_sample = torch.bernoulli(p_h) # sample Bernoullireturn p_h, h_sampledef sample_v(self, h):# Given hidden h, sample visible v p_v = torch.sigmoid(F.linear(h, self.W.t(), self.v_bias)) # p(v=1|h) v_sample = torch.bernoulli(p_v)return p_v, v_sampledef forward(self, v):# Perform k steps of contrastive divergence starting from v v_k = v.clone()for _ inrange(self.k): _, h_k =self.sample_h(v_k) # sample hidden from current visible _, v_k =self.sample_v(h_k) # sample visible from hiddenreturn v_k # k-step reconstructed visibledef free_energy(self, v):# Compute the visible bias term for each sample in the batch vbias_term = (v *self.v_bias).sum(dim=1) # shape: [batch_size]# Compute the activation of the hidden units wx_b = F.linear(v, self.W, self.h_bias) # shape: [batch_size, n_hidden]# Compute the hidden term hidden_term = torch.sum(torch.log1p(torch.exp(wx_b)), dim=1) # shape: [batch_size]# Return the mean free energy over the batchreturn- (vbias_term + hidden_term).mean()transform = transforms.Compose([transforms.ToTensor()])train_dataset = datasets.FashionMNIST(root='./data', train=True, transform=transform, download=True)test_dataset = datasets.FashionMNIST(root='./data', train=False, transform=transform, download=True)def objective(trial): num_rbm_epochs = trial.suggest_int("num_rbm_epochs", 5, 5)# 24, 33) batch_size = trial.suggest_int("batch_size", 192, 1024) rbm_lr = trial.suggest_float("rbm_lr", 0.05, 0.1) rbm_hidden = trial.suggest_int("rbm_hidden", 384, 8192) mlflow.start_run(experiment_id=experiment.experiment_id)if CLASSIFIER !='LogisticRegression': fnn_hidden = trial.suggest_int("fnn_hidden", 192, 384) fnn_lr = trial.suggest_float("fnn_lr", 0.0001, 0.0025) mlflow.log_param("fnn_hidden", fnn_hidden) mlflow.log_param("fnn_lr", fnn_lr) num_classifier_epochs = trial.suggest_int("num_classifier_epochs", 5, 5)# 40, 60) mlflow.log_param("num_rbm_epochs", num_rbm_epochs) mlflow.log_param("batch_size", batch_size) mlflow.log_param("rbm_lr", rbm_lr) mlflow.log_param("rbm_hidden", rbm_hidden) mlflow.log_param("num_classifier_epochs", num_classifier_epochs)# Instantiate RBM and optimizer device = torch.device("mps") rbm = RBM(n_visible=784, n_hidden=rbm_hidden, k=1).to(device) optimizer = torch.optim.SGD(rbm.parameters(), lr=rbm_lr) train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True) test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False) rbm_training_failed =False# Training loop (assuming train_loader yields batches of images and labels)for epoch inrange(num_rbm_epochs): total_loss =0.0for images, _ in train_loader:# Flatten images and binarize v0 = images.view(-1, 784).to(rbm.W.device) # shape [batch_size, 784] v0 = torch.bernoulli(v0) # sample binary input vk = rbm(v0) # k-step CD reconstruction# Compute contrastive divergence loss (free energy difference) loss = rbm.free_energy(v0) - rbm.free_energy(vk) optimizer.zero_grad() loss.backward() optimizer.step() total_loss += loss.item()print(f"Epoch {epoch+1}: avg free-energy loss = {total_loss/len(train_loader):.4f}")if np.isnan(total_loss): rbm_training_failed =Truebreakif rbm_training_failed: accuracy =0.0 macro_f1 =0.0print("RBM training failed — returning 0.0 for accuracy and macro F1") mlflow.log_metric("test_accuracy", accuracy) mlflow.log_metric("macro_f1", macro_f1) mlflow.set_tag("status", "rbm_failed") # Optional tag mlflow.end_run()returnfloat(accuracy)else: rbm.eval() # set in evaluation mode if using any layers that behave differently in training features_list = [] labels_list = []for images, labels in train_loader: v = images.view(-1, 784).to(rbm.W.device) v = v # (optionally binarize or use raw normalized pixels) h_prob, h_sample = rbm.sample_h(v) # get hidden activations features_list.append(h_prob.cpu().detach().numpy()) labels_list.append(labels.numpy()) train_features = np.concatenate(features_list) # shape: [N_train, n_hidden] train_labels = np.concatenate(labels_list)# Convert pre-extracted training features and labels to tensors and create a DataLoader train_features_tensor = torch.tensor(train_features, dtype=torch.float32) train_labels_tensor = torch.tensor(train_labels, dtype=torch.long) train_feature_dataset = torch.utils.data.TensorDataset(train_features_tensor, train_labels_tensor) train_feature_loader = torch.utils.data.DataLoader(train_feature_dataset, batch_size=batch_size, shuffle=True)if CLASSIFIER =='LogisticRegression':# add optuna tuning same as log reg without RBM features... lr_C = trial.suggest_float("lr_C", 0.01, 10.0, log=True) mlflow.log_param("lr_C", lr_C) # Log the chosen C value classifier = LogisticRegression(max_iter=num_classifier_epochs, C=lr_C, solver="saga") classifier.fit(train_features, train_labels) else: classifier = nn.Sequential( nn.Linear(rbm.n_hidden, fnn_hidden), nn.ReLU(), nn.Linear(fnn_hidden, 10) )# Move classifier to the same device as the RBM classifier = classifier.to(device) criterion = nn.CrossEntropyLoss() classifier_optimizer = torch.optim.Adam(classifier.parameters(), lr=fnn_lr) classifier.train()for epoch inrange(num_classifier_epochs): running_loss =0.0for features, labels in train_feature_loader: features = features.to(device) labels = labels.to(device)# Forward pass through classifier outputs = classifier(features) loss = criterion(outputs, labels)# Backpropagation and optimization classifier_optimizer.zero_grad() loss.backward() classifier_optimizer.step() running_loss += loss.item() avg_loss = running_loss /len(train_feature_loader)print(f"Classifier Epoch {epoch+1}: loss = {avg_loss:.4f}")# Evaluate the classifier on test data.# Here we extract features from the RBM for each test image.if CLASSIFIER !='LogisticRegression': classifier.eval() correct =0 total =0 features_list = [] labels_list = []with torch.no_grad():for images, labels in test_loader: v = images.view(-1, 784).to(device)# Extract hidden activations; you can use either h_prob or h_sample. h_prob, _ = rbm.sample_h(v)if CLASSIFIER =='LogisticRegression': features_list.append(h_prob.cpu().detach().numpy()) labels_list.append(labels.numpy())else: outputs = classifier(h_prob) _, predicted = torch.max(outputs.data, 1) total += labels.size(0) correct += (predicted.cpu() == labels).sum().item()if CLASSIFIER =='LogisticRegression': test_features = np.concatenate(features_list) test_labels = np.concatenate(labels_list) predictions = classifier.predict(test_features) accuracy = accuracy_score(test_labels, predictions) *100 macro_f1 = f1_score(test_labels, predictions, average="macro") else: accuracy =100* correct / total all_preds = [] all_labels = [] classifier.eval()with torch.no_grad():for images, labels in test_loader: v = images.view(-1, 784).to(device) h_prob, _ = rbm.sample_h(v) outputs = classifier(h_prob) _, predicted = torch.max(outputs.data, 1) all_preds.extend(predicted.cpu().numpy()) all_labels.extend(labels.numpy()) macro_f1 = f1_score(all_labels, all_preds, average="macro") print(f"Test Accuracy: {accuracy:.2f}%")print(f"Macro F1 Score: {macro_f1:.4f}") mlflow.log_metric("test_accuracy", accuracy) mlflow.log_metric("macro_f1", macro_f1) mlflow.end_run()returnfloat(accuracy if accuracy isnotNoneelse0.0)if__name__=="__main__": study = optuna.create_study(direction="maximize") study.optimize(objective, n_trials=1) # n_trials set to 1 for quick renderingprint(study.best_params)print(study.best_value)print(study.best_trial)
Epoch 1: avg free-energy loss = 83.8848
Epoch 2: avg free-energy loss = 21.5381
Epoch 3: avg free-energy loss = 15.0502
Epoch 4: avg free-energy loss = 12.2593
Epoch 5: avg free-energy loss = 10.6663
Classifier Epoch 1: loss = 0.7195
Classifier Epoch 2: loss = 0.4814
Classifier Epoch 3: loss = 0.4437
Classifier Epoch 4: loss = 0.4223
Classifier Epoch 5: loss = 0.4037
What the plot shows: Highest accuracies cluster between 2000 and 4000 hidden units in the RBM with an outlier at 3764 hidden units. This possibly suggests too few hidden units lacks the complexity needed to explain the data; however, too many hidden units is perhaps causing some overfitting–resulting in poor generalization of the FNN classifier that receives the RBM hidden features.
Test Accuracy by FNN Hidden Units
What the plot shows: Surprisingly, the number of hidden units in the FNN does not show a strong correlation with test accuracy. All hidden units tested seem to result in similar performance. This suggests the FNN is able to learn from the RBM features sufficently, and additional neurons do not significantly improve generalization.
Model
Optuna Best Trial MLflow Test Accuracy(%)
Macro F1 Score
Logistic Regression
84.71
0.846
Feed Forward Network
88.06
0.879
Convolutional Neural Network
91.29
0.913
Logistic Regression (on RBM Hidden Features)
87.14
0.871
Feed Forward Network (on RBM Hidden Features)
86.95
0.869
Conclusion
Summarize your key findings.
CNN clearly outperforms other models. Logistic Regression, which typically performs well for binary classifications tasks, underperforms on Fashion MNIST multiclassification task. Logistic Regression is improved by using a Restricted Boltzmann Machine first to extract the hidden features from the input data prior to classification. Feed Forward Network is not improved by the use of RBM. These findings clearly show the progress in machine and deep learning and how more advanced neural networks on raw pixels can outperform models that use RBM hidden features.
Discuss the implications of your results.
Restricted Boltzmann Machines are no longer considered state-of-the-art for machine learning tasks. While contrastive divergence made training RBMs easier, supervised training of deep feedforward networks and convolutional neural networks using backpropagation proved to be more effective and began to dominate the field. This was largely the result of overcoming challenges with exploding or vanishing gradients and the introduction of techniques such as batch normalization, dropout, and better weight initialization methods.
However, for the student of machine learning, learning RBMs is still valuable for understanding the foundations of unsupervised learning and energy-based models. Today, Stable Diffusion is a popular type of generative AI model with an energy-based foundation. The mechanics of RBM training, like Gibbs sampling, and the probabilistic nature of the model provide a demonstration of the application of probability theory and concepts like Markov chains and Boltzmann distributions in machine learning.
References
Akiba, Takuya, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. “Optuna: A Next-Generation Hyperparameter Optimization Framework.” In The 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2623–31.
Aslan, Narin, Sengul Dogan, and Gonca Ozmen Koca. 2023. “Automated Classification of Brain Diseases Using the Restricted Boltzmann Machine and the Generative Adversarial Network.”Engineering Applications of Artificial Intelligence 126: 106794.
Fiore, Ugo, Francesco Palmieri, Aniello Castiglione, and Alfredo De Santis. 2013. “Network Anomaly Detection with the Restricted Boltzmann Machine.”Neurocomputing 122: 13–23.
Fischer, Asja, and Christian Igel. 2012. “An Introduction to Restricted Boltzmann Machines.” In Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications: 17th Iberoamerican Congress, CIARP 2012, Buenos Aires, Argentina, September 3-6, 2012. Proceedings 17, 14–36. Springer.
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.
Hinton, Geoffrey E. 2002. “Training Products of Experts by Minimizing Contrastive Divergence.”Neural Computation 14 (8): 1771–1800.
Melko, Roger G, Giuseppe Carleo, Juan Carrasquilla, and J Ignacio Cirac. 2019. “Restricted Boltzmann Machines in Quantum Physics.”Nature Physics 15 (9): 887–92.
Ning, Lin, Randall Pittman, and Xipeng Shen. 2018. “LCD: A Fast Contrastive Divergence Based Algorithm for Restricted Boltzmann Machine.”Neural Networks 108: 399–410.
Oh, Sangchul, Abdelkader Baggag, and Hyunchul Nha. 2020. “Entropy, Free Energy, and Work of Restricted Boltzmann Machines.”Entropy 22 (5): 538.
Peng, Chao-Ying Joanne, Kuk Lida Lee, and Gary M Ingersoll. 2002. “An Introduction to Logistic Regression Analysis and Reporting.”The Journal of Educational Research 96 (1): 3–14.
Salakhutdinov, Ruslan, Andriy Mnih, and Geoffrey Hinton. 2007. “Restricted Boltzmann Machines for Collaborative Filtering.” In Proceedings of the 24th International Conference on Machine Learning, 791–98.
Sazlı, Murat H. 2006. “A Brief Review of Feed-Forward Neural Networks.”Communications Faculty of Sciences University of Ankara Series A2-A3 Physical Sciences and Engineering 50 (01).
Smolensky, Paul et al. 1986. “Information Processing in Dynamical Systems: Foundations of Harmony Theory.”
Xiao, Han, Kashif Rasul, and Roland Vollgraf. 2017. “Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms.” August 28, 2017. https://arxiv.org/abs/cs.LG/1708.07747.
Zaharia, Matei, Andrew Chen, Aaron Davidson, Ali Ghodsi, Sue Ann Hong, Andy Konwinski, Siddharth Murching, et al. 2018. “Accelerating the Machine Learning Lifecycle with MLflow.”IEEE Data Eng. Bull. 41 (4): 39–45.